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Abstract 


Continual learning tries to learn new tasks without for- 
getting previously learned ones. In reality, most of the ex- 
isting artificial neural network(ANN) models fail, while hu- 
mans do the same by remembering previous works through- 
out their life. Although simply storing all past data can al- 
leviate the problem, it needs large memory and often infea- 
sible in real-world applications where last data access is 
limited. We hypothesize that the model that learns to solve 
each task continually has some task-specific properties and 
some task-invariant characteristics. We propose a hybrid 
continual learning model that is more suitable in real case 
scenarios to address the issues that has a task-invariant 
shared variational autoencoder and T task-specific varia- 
tional autoencoders. Our model combines generative re- 
play and architectural growth to prevent catastrophic for- 
getting. We show our hybrid model effectively avoids 
forgetting and achieves state-of-the-art results on visual 
continual learning benchmarks such as MNIST, Permuted 
MNIST(QMNIST), CIFAR100, and minilmageNet datasets. 
We discuss results on a few more datasets, such as SVHN, 
Fashion-MNIST, EMNIST, and CIFAR10. Our code is avail- 
able at https://github.com/DVAESCL/DVAESCL. 


1. Introduction 


Humans and animals are capable of continually learn- 
ing and updating knowledge throughout their lifetime. The 
ability to accommodate new experiences while retaining 
previously known knowledge helps to build reusable arti- 
ficial intelligent systems. Current ANNs achieve impres- 
sive performance on many machine learning problems like 
image classification, object detection, and natural language 
processing but fail to remember previous knowledge due 
to a phenomenon called catastrophic forgetting[28] when 
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Figure 1. is our model at training time. Architecture growth oc- 
curs at the arrival of tt” task by adding a task-specific VAE de- 
noted as P*, and a task-specific ANN denoted as p*. To prevent 
forgetting, private VAEs are stored for each task. The first private 
VAE gets trained using real data from the first task. The second 
private VAE is trained using real data from the second task and 
synthesized data corresponding to the first task’s classes generated 
from the first private decoder. Similarly, the (e private VAE sees 
real data from the tt” task, synthesized data from the previous pri- 
vate decoders corresponding to their tasks’ classes during training. 
A shared VAE that is less prone to forgetting yet is also retrained 
with a small number of generative replay. The plus(+) sign indi- 
cates concatenation. 


trained for new tasks. We want our artificial learning agents 
to have the ability to solve many tasks sequentially un- 
der different conditions by developing task-invariant and 
task-specific skills that enable them to adapt while avoid- 
ing forgetting using generative replay quickly. Several ap- 
proaches have been proposed over the years to alleviate 
the catastrophic forgetting. The first approach involves dy- 
namically increasing the network’s capacity to learn new 
tasks[44, 34]. The second approach uses regularizers that 
make the number of parameters remain constant during se- 
quential learning[!, 17, 22, 23, 46]. However, these ap- 
proaches are not feasible if each task needs a large mem- 
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Figure 2. is our model during test time at t*” task. The model gets 
data corresponding to all classes till the tt” task and predicts those 


classes 


ory. Another line of approach is to rely either on experience 
replay[4, 24, 31] or generative replay[33, 37, 40] by storing 
real data from previous tasks or train generative models, re- 
spectively. 

We propose a novel continual learning method in which 
the network is dynamic, uses generative replay, and grows 
in size with each task. A disjoint space representation com- 
poses task-invariant or shared space(fixed size) trained for 
all tasks, and task-specific or private space grows with each 
task. Our approach is motivated by the fact that our hu- 
man brain structure is complex and contains billions of 
neurons[ 12]. We may need to move towards a more com- 
plex neural network structure in the coming future to make 
artificial systems solve the works humans can do nowadays. 
Continual learning can be applied to various practical situ- 
ations involving privacy issues. The main contributions of 
these work are summarized as follows: 


e We develop a structure-based and generative replay- 
based model using (T + 1) numbers of conditional vari- 
ational autoencoders. T is the number of tasks the 
model has to solve. 


e We show the private and shared modules’ training is 
different from[1 1] though that paper inspires our net- 
work’s architecture. 


e We present results for some datasets and show our 
model achieves state-of-the-art performance on all the 
datasets. 


2. Related Work 
2.1. Continual Learning 


The existing continual learning approaches can be di- 
vided into architecture-based, regularization-based, and 
rehearsal-based strategies. 


Architecture-based Methods 


The first approach to prevent catastrophic forgetting does 
modification in the network’s structure by growing a mod- 
ule for each task either physically or logically[44, 23]. 
These methods attempt to localize inference to a subset of 
the network such as columns[26], neorons[ 10, 45], a mask 
over parameters[27, 36]. The performance of learned tasks 
is preserved by storing the learned modules while accom- 
modating new tasks by augmenting the network with new 
modules. PNNs[35] statically grow the architecture, are im- 
mune to forgetting, and utilize prior knowledge via lateral 
connections to previously learned features. Reference[45] 
proposed a dynamically expandable network(DEN) that can 
dynamically decide its network’s capacity as it gets trained 
on a sequence of tasks. These methods impose a computa- 
tional cost in continual learning where many tasks need to 
be learned, and fixed capacity memory can not be consid- 
ered. 


Regularization-based Methods 


The second family of this field is based on regulariza- 
tion. They estimate the importance of a network’s param- 
eters and penalize those weights while switching from one 
task to another task. There are many existing approaches 
for penalizing the weights. One of the methods is the elastic 
weight consolidation[EWC][18], where important parame- 
ters have the highest in terms of the Fisher information ma- 
trix. In reference[46], the weights are computed online and 
kept track of how much the loss changes due to change in 
specific weights and accommodate this information during 
training. Reference[!] focuses on the change on the acti- 
vation instead of considering the loss’s change. This way, 
parameter importance is learned in an unsupervised manner. 
Despite the success gained by these methods, they are often 
limited by the number of tasks. 


Rehearsal-based Methods 


The final family of methods of this domain to mitigate 
forgetting is rehearsal-based. Existing approaches use two 
strategies: either store few samples per class from pre- 
vious tasks or train a generative model like GAN[38] or 
VAE[1!4] or both to sample synthetic data from previously 
learned distributions. The iCaRL[31] stores a subset of 
real data(exemplars). For a constant memory budget, the 
number of data stored per learned class decreases as the 
number of tasks increases, so the models’ performance de- 
cay. Reference[|3] proposed two losses called: the less- 
forget constraint and inter-class separation to prevent for- 
getting. The less-forget loss minimizes the cosine distance 
between the features extracted from the old and new mod- 
els. References[41, 3] introduce a bias-correction layer to 
correct the original fully-connected layer’s output to ad- 
dress the data imbalance between the old and new cate- 


gories. A recent study on tiny episodic memories in con- 
tinual learning are GEM[25], A-GEM[21], MER[32], and 
ER-RES[5]. 


The second strategy in this family does not store any 
data but generates synthetic data using generative models. 
Reference[37] used generative replay with an unconditional 
GAN, where an auxiliary classifier needs to determine 
which classes the generated samples belong. Reference[40] 
is an improved version of[37], where they used class- 
conditional GAN to synthesize data. Reference[15] used 
a generative autoencoder for replay. Synthetic data for pre- 
vious tasks are generated based on the mean and covariance 
matrix using the encoder’s class statistics. The major limi- 
tation of these approaches is the assumption of a Gaussian 
distribution of the data. 


2.2. Space Factorization 


In machine learning, multi-view learning is more ef- 
fective, more promising, and has a better generalization 
ability than single-view learning[43]. The approaches to 
tackle multi-view learning aim at either maximizing the mu- 
tual agreement on different views of the data or focus on 
gaining sub-space shared by many views by assuming that 
the input views are synthesized from that subspace using 
clustering[6], Gaussian processes[39], etc. So, the concept 
of factorizing the space into shared and private sub-spaces 
has been explored[9]. In this paper, we factorize the data 
space into two parts: shared and private. 


3. Shared and Private VAEs with Generative 
Replay for Continual Learning 


We study the problem of learning a sequence of T data 
distributions denoted as D = {D!, D?,..., DT}, where 
Dt = {(X},Y%,T7/)%,} is the data distribution for the 
task t with n; sample tuples of input(X’ € X), target la- 
bel (Y" € J), and task label(T* € T). The goal is to 
learn a sequential function, fo : Dt — Jt, for each task, 
where Y* are the predicted labels corresponding to tt” task. 
fo € (fsUfPU fp), where fs : Dt > £s, fp : Dt + Xp, 
and fp : Xs U p > Yt. We try to achieve our goal by 
training two separate modules: private and shared, to mit- 
igate forgetting of prior knowledge. The model prevents 
catastrophic forgetting in shared and private spaces sepa- 
rately and begins learning fj where 0 € (Os,0p,0p) as 
mapping function from D* to Vt. We use some n samples 
per class to be synthesized prior to tt” task and accumulate 
the generated data to the current task(t*”) to train the model. 
During training the model with tt” task: 
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The cross-entropy loss function for the fj mapping corre- 
sponds to: 
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Where o is the softmax function, in learning a sequence of 
tasks, an ideal f maps the input images X* to their pre- 
dicted labels Y*. 


Variational autoencoders(VA Es) 


Autoencoders can effectively learn input space and 
representation[47, 2]. A VAE is a generative model that fol- 
lows an encoder-latent vector-decode architecture of clas- 
sical autoencoder, which places a prior distribution on the 
input space and uses an expected lower bound to optimize 
the learned posterior. Conditional VAE is an improved ver- 
sion of the VAE, where data are fed to network with class 
properties such as labels. The variational autoencoderis a 
fundamental building block of our approach. Variational 
distribution tries to find a true conditional probability dis- 
tribution over the latent space z through minimizing their 
distance using a variational lower bound limit. The loss 
function for a VAE is: 


[Log (po(2]z))|-Dxx(ao(2l2) || Pace) 
(2) 


Where the first one is the reconstruction loss, and the sec- 
ond term is the KL divergence between q(z|) and p(z). The 
encoder predicts u and X` such that qg(z|£) = N (u, >>), 
from which a latent vector is synthesized via reparametriza- 
tion process. 

The final objective function of our approach for the tt” 
task is: 


Ivar= E 
qe (2|x) 


L® = ALiask + ALian + AsLyag (3) 
Where, 1, A2, and Ag are regularizer constants to control 
the effect of each loss component. The working algorithm 
of this model is presented in Algorithm 1. 


3.1. Avoid forgetting 


Catastrophic forgetting occurs because of the imbalance 
in the data between previous and new classes that creates a 
bias in the network towards the current ones during training, 
and models almost forget previous knowledge. One of our 
approach’s insights is to decouple the single space learned 
for all tasks continually into two parts: shared and private 
sub-spaces. Another approach is the generative replay from 
previous private modules that concatenated into the current 


task’s data during training the model with the current task 
to avoid forgetting. The first private module sees real data 
of the first task; the second private VAE sees real data of the 
second task and synthetic data of the first task. Similarly, 
the third private module gets real data of the third task and 
synthetic data of the first and second tasks generated from 
the first and second private decoders, respectively, during 
training the model with the third task. Whatever data indi- 
vidual private module sees during training, the only shared 
module gets trained by them. It goes like this till the 7*” 
task. 


3.2. Evaluation Matrices 


We estimate the resulting model on all previous tasks 
similar to[25, 8] after training for each new task. We 
use ACC as the average test classification accuracy across 
all classes for continual learning to measure our model’s 
performance. To measure forgetting, we calculate back- 
ward transfer, BWT that calculates how much learning new 
tasks have influenced previous tasks’ performance. While 
BWT < 0 indicates catastrophic forgetting and BWT > 
0, learning new tasks has helped improve performance on 
previous tasks. 


1 T—1 

BWT =—— >, [Rre — Rial (4) 
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ACC is the mean classification accuracy across all tasks. 
Where Rj; is the test classification accuracy on task i after 
sequentially finishing learning the jt” task. 


4. Experiments 


This section consists of the datasets and baselines we 
used in our experiments and the implementation details. 


Datasets 


We perform our approach on the commonly used bench- 
mark datasets for T-Split continual learning, where the 
entire dataset is divided into T tasks. We use 5-Split 
MNIST and Permuted MNIST(QMNIST)([20] previously 
used in[9, 8, 30, 48], 20-Split minilmageNet[45] used 
in[5, 9, 49], 20-Split CIFAR100[ 19] used in[25, 21, 9, 48]. 
We also perform our experiments on 5-Split SVHN[29], 5- 
Split CIFAR10[19], 5-Split Fashion-MNSIT[42] and 13- 
Split not-MNIST(EMNIST)[7]. The datasets’ statistics 
are given in Table |. We have not used any data augmenta- 
tion techniques. 


Algorithm 1 Continual Learning 
Input: (X, Y) ~ D% 
Parameters: 0 € (0s U 0p U 0p) 
Output: Vy 

1: Deen = {} 

2: fort + 1 to T do 

3: fore + 1 to epochs do 


4: Compute Liask using (Xt, Yt) € D* 

5: Compute L% 4p for the shared module using 
(Ae; yt) = Dt 

6: Compute LË 4p for the tt” private module using 
(xt, yt) € Dt 

iE LY = da Leask + ALS ay t+ ALE 

8: 06 0- aVL 


9: endfor 
10: accuracy + function TEST(&Xb st YViests t) 
11: for c + 1 to C do 


12: C is the replay classes. 

13: for i + 1 to n do 

14: n is the number of samples to be generated per 
class for the experience replay. 

15: xi ~x eA 

16: end for 


17: end for i 
18: ÆTI e xttl uy Xp 


19: end for 
Dataset #Classes #Tasks| Input Size #Train | #Test 
Data(k) Data(k 

MNIST 10 5 1x 28 x 28 | 50 10 
QMNIST | 10 1 x 28 x 28 | 60 50 
EMNIST | 26 13 1 x 28 x 28 | 100 12 
F-MNIST | 10 1 x 28 x 28 | 60 10 
CIFARIO | 10 5 3 x 32 x 32 | 50 10 
CIFAR100 | 100 20 3 x 32 x 32 | 50 10 
SVHN 10 5 3 x 32 x 32 | 61 12 
mImageNet} 100 20 3 x 84 x 84 | 50 10 


Table 1. Statistics of the datasets. Where, EMNIST = not-MNIST, 
QMNIST = Permuted MNIST, F-MNIST = Fashio-MNIST, and 
mImageNet = minilmageNet. 


Baselines 


We compare with state-of-the-art approaches, including 
elastic weight consolidation(EWC)[18], Progressive neural 
networks(PNNs)[35], Hard Attention Mask(HAT)[36], and 
ACL[9] using implementations given by[9] unless other- 
wise stated. We compare a few memory-based methods 
A-GEM[21], GEM[25], ER-RES[5] for MNIST, permuted 
MNIST(QMNIST), 20-Split CIFAR100, and 20-Split mini- 
ImageNet. We depend on the implementation provided 
by[9]. On Permuted MNIST results for SI[48] are taken 
from[36], for VCL[30], those are taken from[9], and for 


Dataset Encoder| Decoder Preceptron 
MNIST 4 CL 4 DCL 4 CL and 1 FC 
QMNIST 4 CL 4 DCL 4 CL and 1 FC 
EMNIST 4 CL 4 DCL 4 CL and 1 FC 
Fashion- 4 CL 4 DCL 4 CL and 1 FC 
MNIST 

CIFAR10 4 CL 4 DCL 2 CL and 2 FC 
CIFAR100 4 CL 4 DCL 2 CL and 1 FC 
SVHN 4 CL 4 DCL 2 CL and 2 FC 
minilmageNet| 5 CL 5 DCL 4 CL and 1 FC 


Table 2. The information of the networks used in experiments(CL 
= Convolution layers, DCL = Deconvolution layers, FC = Fully 
connected layers). 


uncertainty-based CL in Bayesian framework(UCB)[8] are 
taken from the actual paper. 


Dataset z dimension | #parameters 
MNIST 108 199019 
QMNIST 108 199019 
EMNIST 108 562939 
Fashion-MNIST 108 199019 
CIFAR10 192 3388428 
CIFAR100 192 13790903 
SVHN 192 3388428 
minilmageNet 96 4107152 


Table 3. The second column gives the dimension of latent space 
for shared and private VAEs for all datasets. The third column 
shows the number of parameters required to train each dataset. 


Implementation Details 


The information of the networks used in our model is 
given in Table 2 for each dataset. For a dataset, the archi- 
tectures of shared and private modules are the same. We 
take PyTorch as our working framework. We train each of 
all datasets for 50 epochs and evaluate our model’s perfor- 
mance at the 25t and 50” epochs. We perform experi- 
ments using 4, 20, and 100 synthetic samples per class dur- 
ing continual training. The Adam optimizer[16] has been 
used for all experiments, and the learning rate for the model 
is 0.0001. The dimension of the latent variables z and the 
number of parameters used for each dataset are given in Ta- 
ble 3. We take A; = Az = A3 = 1 at the loss function. We 
use a Tesla V100 gpu in all our experiments. 


5. Results and Discussion 


In the first set of experiments, we measure ACC, BWT, 
and the memory used by our method and compare it 
against state-of-the-art methods on 20-Split minilmageNet, 
20-Split CIFAR100, 5-Split Permuted MNIST, and 5-Split 
MNIST. Next, we demonstrate the experiments on sequen- 


tially learning single datasets such as SVHN, Fashion- 
MNIST, EMNIST, and CIFAR10. 


5.1. Performance on 20-Split miniImageNet Dataset 


We divided the minilmageNet dataset into twenty tasks 
with five classes for each task. We compare our results with 
several baselines in Table 4. HAT[36] as a regularization- 
based method with no replay data achieves ACC = 59.45. 
A-GEM[21!] and ER-RES[5] use architecture with 25.6M 
parameters along with memory replay. They store thir- 
teen images of size 84 x 84 x 3 per class during contin- 
ual training. Reference[9] used both architecture-based and 
memory-based approaches together and outperformed other 
algorithms, and it achieves ACC = 62.07. Our model beats 
all existing models at a large margin and earns ACC = 100 
when we take 100 synthetic samples per class and train it 
for 50 epochs. The model gives ACC = 77.4 when we take 
only four generated data per class and train it for 25 epochs. 
Our model takes 2900 seconds to learn the 20-Split mini- 
ImageNet data when trained for 50 epochs and used four 
samples generated per class. The ACCs for each task are 
given in Figure 3 using a different combination between the 
number of epochs the model gets trained and the number of 
samples used as a generative replay. 


Method ACC(%) | BWT(%) Arch(MB) M G Re- 
Re- play 
play 

HAT*[36] | 59.45 -0.04 123.6 - - 

PNN**[35]| 58.96 0.00 588 - - 

ER- 57.32 -11.34 | 102.6 V - 

RES*[5] 

A- 52.43 -15.23 | 102.6 V - 

GEM*[21] 

ORD- 28.76 -64.23 | 37.6 - - 

FT[9] 

ACL[9] 62.07 0 113.1 Vv - 

ours 100(0.00)| 6.6(0.2)| 16 - Vv 


Table 4. Results on 20-Split miniImageNet data measuring 
ACC(%), BWT(%), and memory size(MB) for architecture. M 
Replay: memory replay or actual data stored and G replay: Gener- 
ated synthetic data. BWT positive means good. (*) denotes result 
is reproduced by[9].(**) denotes result is obtained using the re- 
implementation setup by[36]. All results are averaged over 3 runs 
and standard deviation is given in parentheses. 


5.2. Performance on 20-Split CIFAR100 Dataset 


We split the whole dataset into 20 tasks, where each con- 
tains five classes. We compare our results with other meth- 
ods in Table 5. ACL[9] is the most competitive baseline, 
takes 24.5 MB to save its architecture, takes 13 images per 
class(1300 images of size (32 x 32 x 3) in total) that require 
16 MB of memory and achieves ACC = 78.08. HAT[36] 
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Figure 3. Results for the 20-Split minilmageNet data. E = Number 
of epochs the model gets trained, S = Number of synthetic data 
used per class as a generative replay. 
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Figure 4. Results for the 20-Split CIFAR100 data. E = Number of 


epochs the model gets trained, S = Number of synthetic data used 
per class as a generative replay. 


Table 5. Results on 20-Split CIFAR100 data measuring ACC(%), 
BWT(%), and memory size(MB) for architecture. M Replay: 
memory replay or actual data stored and G replay: Generated syn- 
thetic data. BWT positive means good. (*) denotes result is ob- 
tained by[9] using original provided code. (**) denotes result is 
obtained using the re-implementation setup by[36]. (°) denotes 
result is reported by[5]. All results are averaged over 3 runs and 
standard deviation is given in parentheses. 


does not depend on replay example but needs 27.2 MB for 
its architecture in which learns task-based attention mask 
reaching ACC = 76.96. PNN[35] guarantees zero forget- 
ting. Our model outperforms all the existing models in a 
large margin and achieves ACC = 99.52 when we train the 
model for 50 epochs and use 100 synthetic data per class. 
The model gives ACC = 80.6 if we train it for 25 epochs 
with only four synthetic samples per class. Our model 
learns 20-Split CIFAR100 data in 1540 seconds if trained 
for 50 epochs and used only four synthetic samples per class 
as a generative replay. The ACCs for each task are given in 
Figure 4 using a different combination between the number 
of epochs the model gets trained and the number of samples 
used as a generative replay. 


Method | ACC(%) | BWT(%) Arch(MB) M G 5.3. Performance on Permuted MNIST Dataset 

Re- Re- 

play | play _|[ Method ACC(%) BWT(%) Arch(MB) 
HAT*[36]| 76.96 0.01 27.2 - - | EWC°7T 18] | 88.2 Ll 
PNN**[35] 75.25 0.00 93.51 - - | HAT°[36] 97.4 i 2.8 
ER- 54.38 -21.99 | 25.4 4 i | UCB°[8] 91.44(0.04) | -0.38(0.02) 2.2 
RES°[5] VCL*[30] | 88.80(0.23) -7.90(0.23) 11 
A- 66.78 -15.09 | 25.4 vV J | VCL- 95.79(0.10) -1.38(0.12) 11 
GEM?[21] C*[30] 
ORD- | 34.71 -48.56 | 27.2 - - | PNN**[35] | 93.5(0.07) Zero N/A 
FT[9] ORD-FT[9] | 44.91(6.61) | -53.690..91) | 1.1 
ACL[9] | 78.08 0 25.1 v : | ACL] 98.03(0.01) | -0.01(0.01) 2.4 
ours 99.52(0.59) 8.50.5) | 53 : Y |M ours 100(0.00) 0.00(0.00) 0.8 


Table 6. Results on Permuted MNIST data measuring ACC(%), 
BWT(%), and memory size(MB) for architecture. BWT positive 
means good. (°°) denotes result is reported by[36]. (°) denotes 
result is reported by original work. (*) denotes result is obtained 
by[9] using original code. (**) denotes result is reported by[5]. All 
results are averaged over 3 runs and standard deviation is given in 
parentheses. 


One of the popular variants of the MNIST dataset in con- 
tinual learning is Permuted MNIST. We divide the dataset 
into five tasks where each contains two classes. We com- 
pare our approach with other methods in Table 6. HAT[36] 
achieves ACC = 91.6 using an architecture of size 1.1 
MB. Vanilla VCL[30] improves ACC and BWT by 7 and 
6.5, respectively, using k-means core-set-memory size of 
200 samples per task(6.3 MB) and architecture size of 1.1 
MB. PNN[35] achieves ACC = 93.5 with zero forgetting. 
ACL[9] achieves ACC = 98.03 takes 0.2 MB for adding 55k 
parameters for each task, and occupies a total of 2.5 MB 
memory. Our model outperforms all methods and achieves 
ACC = 100 when we train it for 50 epochs with four syn- 
thetic samples per class as a generative replay and occupies 
0.8 MB for its architecture. If we train the model for 25 
epochs with only four synthetic samples per class, it gives 


ACC = 98.12. The model learns Permuted MNIST dataset 
in 1300 seconds when trained for 50 epochs with same num- 
ber of synthetic data. The ACCs for each task are given in 
Figure 5 using a different combination between the number 
of epochs the model gets trained and the number of samples 
used as a generative replay. 


Figure 5. Results for the Permuted MNIST data. E = Number of 
epochs the model gets trained, S = Number of synthetic data used 
per class as a generative replay. For other((E:25, S:100), (E:50, 
S:20), and (E:50, S:100)) combinations ACC = 100% for all tasks, 
so we did not plot. 


5.4. Performance on 5-Split MNIST Dataset 


Method ACC(%) BWT(%) Arch(MB) 
EWC[is] | 95.78(0.35) | -4.2(0.21) LI 
HAT”[36] | 99.59(0.01) | 0.000.04 LI 
UCB°[8] 99.63(0.02) | 0.00(0.00) 32 
VCL*[30] | 95.97(1.03) | -4.62(1.28) L1 
VCL- 93.60.20) -3.100.20) 17 
C*[30] 

GEM*[25] | 94.34(0.82) | -2.01(0.05) 6.5 
ORD-FT[9] | 65.96(3.53) | -40.15(4.27) | 1.1 
ACL] 99.76(0.03) | 0.01(0.01) 1.6 
ours 100(0.00) 0.00(0.00) 0.8 


Table 7. Results on 5-Split MNIST data measuring ACC(%), 
BWT(%), and memory size(MB) for architecture. BWT positive 
means good. (°°) denotes result is reported by [8]. (°) denotes 
result is taken from original work. (*) denotes result is obtained 
by [9] using original provided code. All results are averaged over 
3 runs and standard deviation is given in parentheses. 


We divide the MNIST dataset into five tasks, where each 
consists of two classes. We compare our results with other 
existing models in Table 7. EWC, HAT, UCB, and Vanilla 
VCL are regularization-based methods with no memory re- 
play are provided in that Table. Methods relying on mem- 
ory only(GEM) and VCL with k-means core-set(VCL-C) 
where 40 samples are stored per task. ACL gives ACC = 
99.76 with zero-forgetting outperforming UCB with ACC = 


99.63 which uses 40% more memory. ACL uses only archi- 
tecture growth(no experience replay), where 54.3k private 
parameters are added for each task resulting in a memory 
requirement of 1.6 MB for all private modules. ACL’s ar- 
chitecture has a total of 420.1k parameters. Our method 
outperforms all existing models, achieves ACC = 100 with 
BWT = 0 when we train it for 50 epochs with only four 
synthetic data per class as a generative replay, and learns 
the MNIST dataset in 660 seconds. The ACCs for each task 
are given in Figure 6 using a different combination between 
the number of epochs the model gets trained and the number 
of samples used as a generative replay. 
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95 = 
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Figure 6. Results for the MNIST data. E = Number of epochs the 
model gets trained, S = Number of synthetic data used per class 
as a generative replay. For other((E:25, S:100), (E:50, S:20), and 
(E:50, S:100)) combinations ACC = 100% for all tasks, so we did 
not plot. 


5.5. Performance on SVHN Dataset 
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Figure 7. Results for the SVHN data. E = Number of epochs the 
model gets trained, S = Number of synthetic data used per class as 
a generative replay. 


We divide the SVHN dataset into five tasks, where each 
consists of two classes. Our model learns the whole dataset 
continually around 800 seconds when trained for 50 epochs 
and used only four synthetic samples per class as a genera- 


tive replay. The network has a total of 3388428 parameters 
that occupies 13 MB of memory. The model gives ACC = 
100. How the model performs in each task is given in Fig- 
ure 7 with few combinations between the number of epochs 
the model gets trained and the number of samples used as a 
generative replay. 


5.6. Performance on Fashion-MNIST dataset 


We divide the Fashion-MNIST dataset into five tasks, 
where each consists of two classes. Our model gives ACC 
= 100, learns the whole dataset continually around 600 sec- 
onds when trained for 50 epochs, and used only four syn- 
thetic samples per class as a generative replay. The network 
has a total of 199019 parameters that occupies 0.8 MB of 
memory. How the model performs in each task is given 
in Figure 8 with few combinations between the number of 
epochs the model gets trained and the number of samples 
used as a generative replay. 


Figure 8. Results for the Fashion-MNIST data. E = Number of 
epochs the model gets trained, S = Number of synthetic data used 
per class as a generative replay. 


5.7. Performance on EMNIST Dataset 
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Figure 9. Results for the EMNIST data. E = Number of epochs the 
model gets trained, S = Number of synthetic data used per class as 
a generative replay. 


We divide the EMNIST dataset into thirteen tasks, where 
each consists of two classes. Our model gives ACC = 100, 
learns the whole dataset continually around 1800 seconds 
when trained for 50 epochs, and used only four synthetic 
samples per class as a generative replay. The network has a 
total of 562939 parameters that occupies 2 MB of memory. 
The model’s performance in each task is given in Figure 9 
with few combinations between the number of epochs the 
model gets trained and the number of samples used as a 
generative replay. 


5.8. Performance on CIFARI0 Dataset 
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Figure 10. Results for the CIFAR10 data. E = Number of epochs 
the model gets trained, S = Number of synthetic data used per class 
as a generative replay. 


We divide the CIFAR10 dataset into five tasks, where 
each consists of two classes. Our model gives ACC = 
100, learns the whole dataset continually around 800 sec- 
onds when trained for 50 epochs, and used only four syn- 
thetic samples per class as a generative replay. The network 
has a total of 3388428 parameters that occupies 13 MB of 
memory. The model’s performance in each task is given in 
Figure 10 with few combinations between the number of 
epochs the model gets trained and the number of samples 
used as a generative replay. 


6. Conclusion 


In this paper, we propose a novel hybrid continual learn- 
ing algorithm that grows in size with each task and factor- 
izes the representation learned for a sequence of tasks into 
task-invariant and task-specific sub-spaces. The learning 
method combines generative replay and architecture-based 
approaches together. We show that it gives excellent perfor- 
mance on datasets having more classes or more tasks like 
minilmagenet and CIFAR100. We establish a new state-of- 
the-art on continual learning benchmark datasets. For future 
work, we are interested in extending this work in class in- 
cremental learning. 
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